I want to thank my mentors for their valuable inputs to this project.
Dr. Blanca Himes, Associate Professor, Biostatistics, Epidemiology and Informatics;
Dr. Jesse Hsu, Assistant Professor, Biostatistics, Epidemiology and Informatics;
Sherrie Xie, VMD-PhD Candidate, Epidemiology
My project aims to summarize and visualize the trends in the use of animals in biomedical research by species, pain levels and state in the US the over the period from 2008 to 2019. The data is obtained from the US Department of Agriculture’s APHIS annual reports. Following this exploratory analysis, a prediction for the number of animals that will be used over the next 5 years will be made.
The use of animals in biomedical research is ubiquitous. There are two main categories for animal use: basic biomedical sciences research and (mimicking a human disease in an animal model and studying gene expression, molecular mechanisms, etc.) and drug testing (toxicity and efficacy of an experimental drug). A growing body of studies and voices from the scientific community have pointed out the poor reliability and predictive value of animal models for human outcomes and for understanding human physiology. In 2004, the FDA estimated that 92% of drugs that pass preclinical tests in animal models fail in human clinical trials. More recent analysis suggests that, in spite of efforts to improve the predictability of animal testing, the failure rate has increased to 96%.
In 1938, Congress passed the U.S. Federal Food, Drug, and Cosmetic Act, mandating animal toxicity testing. As of October 7, 2021, Congress introduced the bipartisan FDA Modernization Act to end animal testing mandates. This comes after the European Parliament resoundingly passed a resolution on September 16, 2021, with a vote of 667 to 4 to phase out animal testing. How has animal use in research changed over the past few years in the US? What is the significance of species, state, pain levels? Using the limited USDA data, I aim to answer these questions. Furthermore, a regression model will predict the number of animals that will be needed in the next few years if the same trends continue. This analysis could be of interest to biomedical companies to reduce their time and resources spent on animal models, FDA for revision of regulatory requirements, bioethics specialists and animal advocacy groups.
The trends in animal use could be influenced by technologies that enhanced the ease of building animal models, technologies that performed better than animal models, change in bioethical standards, change in public sentiment about the topic and even opinions voiced by the heads of government regulatory bodies. I am personally interested in this analysis because one of my career goals is to work towards developing and commercializing alternatives to animal methods in biomedical research.
options(scipen = 999)
knitr::opts_chunk$set(echo = TRUE)
library("tidyverse")
library("dplyr")
library("ggplot2")
library("gridExtra")
library("sf")
library("spData")
library("mapview")
library("leaflet")
USDA publishes an annual report of the number of animals used by state, species and pain category. The report for every year is published in January two years later. Hence the latest data I could obtain was from 2019. The reports were published starting 2008, hence I have data for a 12 year period.
The data is in the form of several PDFs. I cleaned the data manually into .txt format. The nomenclature is year_col_X where X is the pain category. The pain categories are: - Column B: Animals held by a facility but not used in any research that year - Column C: Animals used in research; no pain involved; no pain drugs administered - Column D: Animals used in research; pain involved; pain drugs administered - Column E: Animals used in research; pain involved; no pain drugs administered.
The following code chunk loads the data as a dataframe, stacked by year and column. The loaded data has 53 rows per year per column. This includes the 50 states, District of Columbia, Puerto Rico, and “REPORT TOTAL” which is the sum of all columns (species). Note that the species column is not an exhaustive list of all animals used in biomedical research but only those covered by the Animal Welfare Act. The AWA does not cover rats, mice, birds and reptiles, which happen to be over 95% of all animals used. Columns B, C, D and E are assigned values 0, 1, 2 and 3 to reflect pain levels in a more intuitive way.
#Loading the data using nested functions
datafr <- purrr::map_dfr(
.x = c(2008:2019),
.f = function(x){
purrr::map_dfr(
.x = c("B", "C", "D", "E"),
.f = function(x, y){
dat <- read.table(file = paste0("C:/Users/Deeksha Hegde/Downloads/BMIN503_Final_Project/USDA_Data/",y,"_col_",x,".txt"), header = TRUE, sep = "\t")
dat %>%
mutate(across(.cols = All_Other_Covered_Species:Total, .fns =~ as.numeric(str_remove_all(string = .x, ","))),
Year = y,
Column = x
)
}, y=x
)
}
)
#Checking if the data loaded is correct
datafr %>% count(Year, Column)
## Year Column n
## 1 2008 B 53
## 2 2008 C 53
## 3 2008 D 53
## 4 2008 E 53
## 5 2009 B 53
## 6 2009 C 53
## 7 2009 D 53
## 8 2009 E 53
## 9 2010 B 53
## 10 2010 C 53
## 11 2010 D 53
## 12 2010 E 53
## 13 2011 B 53
## 14 2011 C 53
## 15 2011 D 53
## 16 2011 E 53
## 17 2012 B 53
## 18 2012 C 53
## 19 2012 D 53
## 20 2012 E 53
## 21 2013 B 53
## 22 2013 C 53
## 23 2013 D 53
## 24 2013 E 53
## 25 2014 B 53
## 26 2014 C 53
## 27 2014 D 53
## 28 2014 E 53
## 29 2015 B 53
## 30 2015 C 53
## 31 2015 D 53
## 32 2015 E 53
## 33 2016 B 53
## 34 2016 C 53
## 35 2016 D 53
## 36 2016 E 53
## 37 2017 B 53
## 38 2017 C 53
## 39 2017 D 53
## 40 2017 E 53
## 41 2018 B 53
## 42 2018 C 53
## 43 2018 D 53
## 44 2018 E 53
## 45 2019 B 53
## 46 2019 C 53
## 47 2019 D 53
## 48 2019 E 53
datafr <- mutate(datafr, pain.level = factor(Column, levels = c("B", "C", "D", "E"), labels = c(0, 1, 2, 3)))
head(datafr)
## State All_Other_Covered_Species Cats Dogs Guinea_Pigs Hamsters Marine.Mammals
## 1 AK 426 0 6 0 190 0
## 2 AL 132 34 110 0 0 0
## 3 AR 92 69 22 15 0 0
## 4 AZ 144 8 26 0 0 0
## 5 CA 1870 409 118 735 60 0
## 6 CO 358 253 4 423 103 0
## Nonhuman_Primates Other_Farm_Animals Pig Rabbits Sheep Total Year Column
## 1 0 0 0 0 0 622 2008 B
## 2 0 612 4 321 6 1219 2008 B
## 3 34 0 0 0 0 232 2008 B
## 4 38 1 12 32 3 264 2008 B
## 5 6009 4473 45 4438 179 18336 2008 B
## 6 0 0 0 0 0 1141 2008 B
## pain.level
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Now let us visualize the numbers!
The first plot shows the total animals (all categories combined) by year. The second plot shows the same but with the breakup by column.
data.total <- filter(datafr, State == "REPORT TOTAL")
#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))
ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Total by year and column
total.plot <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
total.plot
We see that there is a huge rise in the total from 2009 to 2010, and a steady decline since then. Overall, there is a significant decline over the 12 year period.
From the second plot, we see that pain category 1, which involves no pain and no pain drugs, is the highest in magnitude throughout the study period. We also see that the decline in the total animals is mainly due to the decline in animals at pain level 1.
Now we will examine the total numbers by state.
#Total by column by state
data.by.state <- datafr %>%
filter(State != "REPORT TOTAL") %>%
group_by(State, pain.level) %>%
summarise(Total = sum(Total))
data.by.state.total <- group_by(data.by.state, State) %>%
summarize(Total.all = sum(Total))
ggplot(data.by.state.total, aes(x = State, y = Total.all)) + geom_bar(stat = "identity") + ggtitle("Total number of animals used by state") + ylab(label = "Total across all pain levels in hundred thousands") + scale_y_continuous(breaks = seq(100000, 1300000, by = 100000), labels = seq(1, 13, 1)) +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
axis.text.y = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
ggplot(data.by.state, aes(x = State, y = Total, color = pain.level)) + geom_point(size = 4) + ggtitle("Total number of animals used by state by pain level") + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + ylab(label = "Total in hundred thousands") +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
axis.text.y = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
From the graphs, we get the following information:
States using the highest number of animals overall across all pain levels: CA, MA, NJ
States using the highest number of animals by pain level:
0 CA, MD, TX
1 CA, MA, OH
2 CA, TX, MA
3 MO, MI, IA
An arguably better representation of the same data is the use of interactive maps.
usa <- st_read("C:/Users/Deeksha Hegde/Downloads/us-state-boundaries/us-state-boundaries.shp") %>%
rename(State = stusab)
## Reading layer `us-state-boundaries' from data source
## `C:\Users\Deeksha Hegde\Downloads\us-state-boundaries\us-state-boundaries.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS: WGS 84
head(usa)
## Simple feature collection with 6 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.8485 ymin: 14.03656 xmax: 146.1544 ymax: 49.00241
## Geodetic CRS: WGS 84
## gid arealand division intptlat
## 1 16 278176477 0 18.21765
## 2 23 472276664 0 14.93678
## 3 31 1627312771 7 34.89553
## 4 35 2136109036 5 38.64729
## 5 40 -1616974352 1 41.59742
## 6 54 312831514 9 47.40732
## name objectid areawater intptlon
## 1 Puerto Rico 50 628200285 -66.41080
## 2 Commonwealth of the Northern Mariana Islands 36 349301029 145.60102
## 3 Arkansas 44 -1334552525 -92.44463
## 4 West Virginia 1 489848791 -80.61833
## 5 Rhode Island 6 1323457457 -71.52727
## 6 Washington 20 -324557627 -120.57580
## oid funcstat centlon State state statens centlat
## 1 303146031 A -66.41476 PR 72 01779808 18.21647
## 2 -1625647860 A 145.59687 MP 69 01779809 16.79744
## 3 266078934 A -92.44270 AR 05 00068085 34.89402
## 4 -1929409300 A -80.62346 WV 54 01779805 38.64119
## 5 -1861167639 A -71.52472 RI 44 01219835 41.59403
## 6 -1859906639 A -120.59557 WA 53 01779804 47.41490
## basename mtfcc region lsadc geoid
## 1 Puerto Rico G4000 9 00 72
## 2 Commonwealth of the Northern Mariana Islands G4000 9 00 69
## 3 Arkansas G4000 3 00 05
## 4 West Virginia G4000 3 00 54
## 5 Rhode Island G4000 1 00 44
## 6 Washington G4000 4 00 53
## geometry
## 1 MULTIPOLYGON (((-67.20794 1...
## 2 MULTIPOLYGON (((145.5726 15...
## 3 MULTIPOLYGON (((-94.55218 3...
## 4 MULTIPOLYGON (((-81.74725 3...
## 5 MULTIPOLYGON (((-71.7897 41...
## 6 MULTIPOLYGON (((-123.2479 4...
to_map <- inner_join(usa, data.by.state.total, by = "State")
pal_fun <- colorNumeric("BuPu", NULL)
pu_message <- paste0(to_map$State,
"<br>Number of animals used: ",
prettyNum(to_map$Total.all, big.mark = ","))
leaflet(to_map) %>%
addPolygons(stroke = FALSE, # remove polygon borders
fillColor = ~pal_fun(Total.all),
fillOpacity = 0.8, smoothFactor = 0.5,
popup = pu_message) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addLegend("bottomright",
pal=pal_fun,
values=~Total.all,
title = 'Total number of animals used',
opacity = 1) %>%
addScaleBar()
#REMOVE?
usa <- st_read("C:/Users/Deeksha Hegde/Downloads/us-state-boundaries/us-state-boundaries.shp") %>%
rename(State = stusab)
## Reading layer `us-state-boundaries' from data source
## `C:\Users\Deeksha Hegde\Downloads\us-state-boundaries\us-state-boundaries.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS: WGS 84
head(usa)
## Simple feature collection with 6 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.8485 ymin: 14.03656 xmax: 146.1544 ymax: 49.00241
## Geodetic CRS: WGS 84
## gid arealand division intptlat
## 1 16 278176477 0 18.21765
## 2 23 472276664 0 14.93678
## 3 31 1627312771 7 34.89553
## 4 35 2136109036 5 38.64729
## 5 40 -1616974352 1 41.59742
## 6 54 312831514 9 47.40732
## name objectid areawater intptlon
## 1 Puerto Rico 50 628200285 -66.41080
## 2 Commonwealth of the Northern Mariana Islands 36 349301029 145.60102
## 3 Arkansas 44 -1334552525 -92.44463
## 4 West Virginia 1 489848791 -80.61833
## 5 Rhode Island 6 1323457457 -71.52727
## 6 Washington 20 -324557627 -120.57580
## oid funcstat centlon State state statens centlat
## 1 303146031 A -66.41476 PR 72 01779808 18.21647
## 2 -1625647860 A 145.59687 MP 69 01779809 16.79744
## 3 266078934 A -92.44270 AR 05 00068085 34.89402
## 4 -1929409300 A -80.62346 WV 54 01779805 38.64119
## 5 -1861167639 A -71.52472 RI 44 01219835 41.59403
## 6 -1859906639 A -120.59557 WA 53 01779804 47.41490
## basename mtfcc region lsadc geoid
## 1 Puerto Rico G4000 9 00 72
## 2 Commonwealth of the Northern Mariana Islands G4000 9 00 69
## 3 Arkansas G4000 3 00 05
## 4 West Virginia G4000 3 00 54
## 5 Rhode Island G4000 1 00 44
## 6 Washington G4000 4 00 53
## geometry
## 1 MULTIPOLYGON (((-67.20794 1...
## 2 MULTIPOLYGON (((145.5726 15...
## 3 MULTIPOLYGON (((-94.55218 3...
## 4 MULTIPOLYGON (((-81.74725 3...
## 5 MULTIPOLYGON (((-71.7897 41...
## 6 MULTIPOLYGON (((-123.2479 4...
pl0 <- filter(data.by.state, pain.level == 0)
to_map_0 <- inner_join(usa, pl0, by = "State")
pal_fun <- colorNumeric("BuPu", NULL)
pu_message <- paste0(to_map_0$State, # paste0 to append tract name with other relevant text
"<br>Number of animals used: ", # <br> forces new line
# use round function to round continuous poverty rate to one decimal point
prettyNum(to_map_0$Total, big.mark = ","))
leaflet(to_map_0) %>%
addPolygons(stroke = FALSE, # remove polygon borders
fillColor = ~pal_fun(Total),
fillOpacity = 0.8, smoothFactor = 0.5, # increase opacity and resolution
popup = pu_message) %>%
addProviderTiles(providers$CartoDB.Positron) %>% # add third party provider tile
#addProviderTiles(providers$Stamen.Toner) %>%
#addProviderTiles(providers$Esri.NatGeoWorldMap)
addLegend("bottomright", # location of legend
pal=pal_fun, # palette function
values=~Total, # variable to be passed to palette function
title = 'Total number of animals used at pain level 0', # legend title
opacity = 1) %>% # legend opacity (1 = completely opaque)
addScaleBar()
Now, since we’re in Pennsylvania, let us see what the plots for PA look like.
#Total for PA
data.PA <- filter(datafr, State == "PA")
ggplot(data = summarise(group_by(data.PA, Year), Total = sum(Total)), aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year in Pennsylvania")
#Total by column for PA
ggplot(data = data.PA, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) +
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Pennsylvania")
We see a huge decline from 2009 to 2015.
Next, I want to find out which states contributed the most to the decline observed in the first plot. I calculated the mean and standard deviation over the years for each state and sorted them in decreasing order of SD. Noticing that Arizona seems to have a higher SD than mean, which raised suspicion, I calculated the coefficient of variability (CV) and sorted in the decreasing order of CV.
#Std dev of total for all states
data.by.state.year <- datafr %>%
filter(State != "REPORT TOTAL") %>%
group_by(State, Year) %>%
summarise(Total = sum(Total))
state.std.dev <- summarise(data.by.state.year, mean = mean(Total), sd = sd(Total))
state.std.dev[order(state.std.dev$sd, decreasing = TRUE),]
## # A tibble: 52 x 3
## State mean sd
## <chr> <dbl> <dbl>
## 1 AZ 16261. 35984.
## 2 NY 45696. 25676.
## 3 CA 103133. 22675.
## 4 KS 11863. 20574.
## 5 PA 50591. 20247.
## 6 OH 64393. 18341.
## 7 IA 30172 15683.
## 8 NJ 67210. 13946.
## 9 MO 36277. 13905.
## 10 MD 58124. 13290.
## # ... with 42 more rows
state.std.dev <- state.std.dev %>% mutate(cv = sd/mean) %>%
arrange(desc(cv))
state.std.dev
## # A tibble: 52 x 4
## State mean sd cv
## <chr> <dbl> <dbl> <dbl>
## 1 AZ 16261. 35984. 2.21
## 2 KS 11863. 20574. 1.73
## 3 WV 2116. 2936. 1.39
## 4 NE 4406. 3886. 0.882
## 5 AK 636. 468. 0.735
## 6 HI 337 227. 0.675
## 7 DC 7995. 5279. 0.660
## 8 WY 543. 307. 0.565
## 9 NY 45696. 25676. 0.562
## 10 IN 17136. 9132. 0.533
## # ... with 42 more rows
The states having CV > 1 will be examined further.
#Total by column for AZ
data.AZ <- filter(datafr, State == "AZ")
ggplot(data = data.AZ, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona")
Column C (pain level 1) of the year 2010 is clearly an outlier. I verified the data again from the USDA report to check if there was an error in data loading. It is most likely a typographical error in the USDA report. On examination, I found that the outlier is coming from the All_Other_Covered_Species column. I first replaced this value with NA and found the mean along the column for the rest of the years. I replaced the NA with the rounded mean and finally updated the total. The updated plot for AZ is also shown.
#Setting the outlier point to NA
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- NA
#Setting the outlier point to the mean of the column over the years excluding outlier year
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- round(mean(data.AZ[, 2][data.AZ['Column'] == "C"], na.rm = TRUE), digits = 0)
#Updating the Total column for the year
data.AZ[, 'Total'][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- sum(data.AZ[2:12][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010])
#Updated plot with outlier adjusted
ggplot(data = data.AZ, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona (Outlier adjusted)")
The second state having CV > 1 is Kansas. Let’s look at its plot.
#Total by column for KS
data.KS <- filter(datafr, State == "KS")
ggplot(data = data.KS, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Kansas")
2019 pain level 1 is likely to be an outlier but unlike the case with Arizona, we cannot be sure since we don’t have the data for the years after 2019.
Finally, we have West Virginia.
#Total by column for WV
data.WV <- filter(datafr, State == "WV")
ggplot(data = data.WV, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in West Virginia")
WV does not seem to have outliers, since there is an increase and decrease in all columns over 4 years from 2015-2019.
Now that we have investigated outliers, let us go back to the Total number of animals vs. Year plots and revise them.
datafr[datafr['State'] == "AZ" & datafr['Column'] == "C" & datafr['Year'] == 2010, ] <- data.AZ[data.AZ['Column'] == "C" & data.AZ['Year'] == 2010, ]
datafr[530, 2:13] <- colSums(datafr[478:529, 2:13])
data.total <- filter(datafr, State == "REPORT TOTAL")
#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))
ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Total by year and column
total.plot <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
total.plot
#datafr[datafr['State'] == "REPORT TOTAL" & (datafr['Year'] == 2015 | datafr['Year'] == 2016), ]
data.total1 <- mutate(data.total1, percentage.change = (Total/lag(Total) -1)*100)
data.total2 <- mutate(data.total1, percentage.change = 100*(Total - first(Total))/Total) #This gives the same as first plot with different y axis
#This tells us that the overall change in 12 years is a 24% reduction.
ggplot(data = data.total1, aes(x = Year, y = percentage.change)) + geom_line(na.rm = TRUE) + scale_x_continuous(breaks = c(2008:2019))
Now we observe a more steady decline in the total number of animals over the years.
Next, I want to visualize the patterns in each of the states.
#Nested functions for 50 plots
ggplot(data = datafr %>% filter(State != "REPORT TOTAL"), aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 2.5) + ggtitle("Total number of animals used by each state by pain levels") +
scale_x_continuous(breaks = c(2008:2019)) + facet_wrap( ~ State, ncol = 5, scales = "free") + theme(
strip.text.x = element_text(size = 40, color = "blue", face = "bold"),
plot.title = element_text(size = 50, face = "bold", hjust = 0.5),
axis.title = element_blank(),
axis.text = element_text(size = 8),
legend.position = "bottom",
legend.justification = "right",
legend.title = element_text(size = 50),
legend.text = element_text(size = 50),
legend.key.width = unit(x = 3, units = "in")
)
From this grid, we see that out of the top 10 states using the highest number of animals, 6 of them saw a decreasing trend in the study period.
Doing the same standard deviation analysis for species. Do bar plots for pain levels.
#Std dev of species for top states
data.CA <- filter(datafr, State == "CA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
arrange(desc(sd)) %>% full_join(
filter(datafr, State == "CA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species")
ggplot(data.CA, aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Standard deviation of species over 12 years in CA") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
data.CA.1 <- filter(datafr, State == "CA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year, pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
group_by(pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd")
ggplot(data.CA.1, aes(x = pain.level, y = sd, fill = species)) + geom_bar(stat = "identity", width = 0.5) + ggtitle("Standard deviation of species over 12 years in CA by pain level") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Std dev of species for top states
data.MA <- filter(datafr, State == "MA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
arrange(desc(sd))
ggplot(data.MA, aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + ggtitle("Standard deviation of species over 12 years in MA") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
data.MA.1 <- filter(datafr, State == "MA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year, pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
group_by(pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd")
ggplot(data.MA.1, aes(x = pain.level, y = sd, fill = species)) + geom_bar(stat = "identity", position = "dodge", width = 0.5) + ggtitle("Standard deviation of species over 12 years in MA by pain level") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
library("ggplot2")
#Std dev of species for top states
data.NJ <- filter(datafr, State == "NJ") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
#colMeans(data.NJ, na.rm = TRUE)
#mean needed?
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
arrange(desc(sd))
ggplot(data.NJ, aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + ggtitle("Standard deviation of species over 12 years in NJ") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
data.NJ.1 <- filter(datafr, State == "NJ") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year, pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
group_by(pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd")
ggplot(data.NJ.1, aes(x = pain.level, y = sd, fill = species)) + geom_bar(stat = "identity", position = "dodge", width = 0.5) + ggtitle("Standard deviation of species over 12 years in NJ by pain level") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
m1 <- lm(Total ~ Year, data = data.total1)
summary(m1)
##
## Call:
## lm(formula = Total ~ Year, data = data.total1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85230 -18560 -5143 25807 52073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54799007 6887537 7.956 0.0000124 ***
## Year -26705 3421 -7.807 0.0000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40910 on 10 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.845
## F-statistic: 60.95 on 1 and 10 DF, p-value: 0.00001458
predict(m1, newdata = data.total1)
## 1 2 3 4 5 6 7 8
## 1176308.3 1149603.7 1122899.2 1096194.7 1069490.1 1042785.6 1016081.1 989376.5
## 9 10 11 12
## 962672.0 935967.5 909262.9 882558.4
data.total.1.predict <- data.frame(year = 2019:2024, round(predict(m1, newdata = tibble(Year = 2019:2024), interval = "predict"), digits = 0))
data.total.1.predict
## year fit lwr upr
## 1 2019 882558 778845 986272
## 2 2020 855854 748832 962875
## 3 2021 829149 718394 939905
## 4 2022 802445 687570 917320
## 5 2023 775740 656401 895080
## 6 2024 749036 624924 873148
ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line() + geom_smooth(formula = "y~x", method = "lm") + geom_line(data = data.total.1.predict, mapping = aes(x = year, y = fit), linetype = "dashed", color = "blue") + scale_x_continuous(breaks = c(2008:2024)) + scale_y_continuous(breaks = seq(700000, 1300000, by = 100000), labels = seq(7, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
m.quad <- lm(Total ~ Year+Year.sq, data = mutate(data.total1, Year.sq = Year*Year))
summary(m.quad)
##
## Call:
## lm(formula = Total ~ Year + Year.sq, data = mutate(data.total1,
## Year.sq = Year * Year))
##
## Residuals:
## Min 1Q Median 3Q Max
## -80005 -24414 -3277 28059 54767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2245798614.7 4728853108.7 0.475 0.646
## Year -2203020.4 4697157.1 -0.469 0.650
## Year.sq 540.4 1166.4 0.463 0.654
##
## Residual standard error: 42610 on 9 degrees of freedom
## Multiple R-squared: 0.8623, Adjusted R-squared: 0.8317
## F-statistic: 28.19 on 2 and 9 DF, p-value: 0.0001333
#make another column to scale year 0-12?